Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial work to reconcile devices from launcher pods #91

Merged

Conversation

ibrokethecloud
Copy link
Collaborator

@ibrokethecloud ibrokethecloud commented Sep 13, 2024

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Problem:

Rancher allows provisioning of downstream clusters to leverage vGPUs.

When used with a machine deployment of more than 1 node, the actual gpu allocated is different from name specified since kubevirt leverages the DeviceName field to calculate launcher pod resource requirements, which are then subsequently used by schedular to identify correct nodes.

Since the actual name is not really used for device allocation, any random string can be used and this makes it difficult to track vGPU allocation in the cluster.

Solution:

PR introduces a minor change to leverage pod environment variables being set by the device plugin during the ContainerAllocateResponse. The additional vmi controller execs into the launcher pod to identify the ID set for resource, and subsequently map it to the GPU/Hostdevice resources. Once the devices are identify an annotation is set on the VM to provide info about actual device from a pool of devices passed through to the VM

For example for a host device this looks as follows:

harvesterhci.io/deviceAllocationDetails: '{"hostdevices":{"intel.com/8d26":["vgpu-01-0000001d0"]}}'

For GPU devices ithis looks as follows:

  harvesterhci.io/deviceAllocationDetails: '{"gpus":{"nvidia.com/NVIDIA_A2-2Q":["vgpu-01-000008005"]}}'

When a VM is shutdown the annotation is used to replace name for vGPU or host devices with actual device names from the deviceAllocationDetails annotation. This ensures that the VM can be edited and devices can be removed from Harvester UI post provisioning.

Related Issue:

Test plan:

@ibrokethecloud
Copy link
Collaborator Author

backend changes related to: rancher/dashboard#11399

Copy link
Contributor

@Yu-Jack Yu-Jack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and just a reminder. We need to raise a chart PR for updating the service account permission to allow that pcidevices-controller can list/update virtualmachines and list virtualmachineinstances.

pkg/controller/virtualmachine/virtualmachine.go Outdated Show resolved Hide resolved
pkg/controller/virtualmachine/virtualmachine.go Outdated Show resolved Hide resolved
pkg/controller/virtualmachine/virtualmachine.go Outdated Show resolved Hide resolved
Copy link
Member

@WebberHuang1118 WebberHuang1118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a nit, thanks.

Copy link
Member

@WebberHuang1118 WebberHuang1118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the PR

initial work to reconcile devices from launcher pods

refactor vmi reconcile to pass codefactor check

refactor code and change validation logic to use new annotations

modified vmi/vm reconcile to also update vm with correct device names based on vgpu annotation when VM is stopped

include pr review feedback

fine tune logic for removing duplicates from allocation annotation
@ibrokethecloud ibrokethecloud merged commit b913aa8 into harvester:master Oct 2, 2024
5 checks passed
ibrokethecloud added a commit to harvester/charts that referenced this pull request Oct 6, 2024
ibrokethecloud added a commit to ibrokethecloud/harvester-charts that referenced this pull request Oct 7, 2024
ibrokethecloud added a commit to harvester/charts that referenced this pull request Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants